model compression
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Asia > Taiwan > Taiwan Province > Taipei (0.04)
- North America > United States > Texas > Brazos County > College Station (0.04)
- North America > Canada (0.04)
- Research Report (0.68)
- Overview (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- North America > United States > California > San Mateo County > Menlo Park (0.04)
Linearly Decomposing and Recomposing Vision Transformers for Diverse-Scale Models
Vision Transformers (ViTs) are widely used in a variety of applications, while they usually have a fixed architecture that may not match the varying computational resources of different deployment environments. Thus, it is necessary to adapt ViT architectures to devices with diverse computational overheads to achieve an accuracy-efficient trade-off. This concept is consistent with the motivation behind Learngene. To achieve this, inspired by polynomial decomposition in calculus, where a function can be approximated by linearly combining several basic components, we propose to linearly decompose the ViT model into a set of components called learngenes during element-wise training. These learngenes can then be recomposed into differently scaled, pre-initialized models to satisfy different computational resource constraints. Such a decomposition-recomposition strategy provides an economical and flexible approach to generating different scales of ViT models for different deployment scenarios. Compared to model compression or training from scratch, which require to repeatedly train on large datasets for diverse-scale models, such strategy reduces computational costs since it only requires to train on large datasets once. Extensive experiments are used to validate the effectiveness of our method: ViTs can be decomposed and the decomposed learngenes can be recomposed into diverse-scale ViTs, which can achieve comparable or better performance compared to traditional model compression and pre-training methods. The code for our experiments is available in the supplemental material.
HPM-KD: Hierarchical Progressive Multi-Teacher Framework for Knowledge Distillation and Efficient Model Compression
Haase, Gustavo Coelho, da Silva, Paulo Henrique Dourado
Knowledge Distillation (KD) has emerged as a promising technique for model compression but faces critical limitations: (1) sensitivity to hyperparameters requiring extensive manual tuning, (2) capacity gap when distilling from very large teachers to small students, (3) suboptimal coordination in multi-teacher scenarios, and (4) inefficient use of computational resources. We present \textbf{HPM-KD}, a framework that integrates six synergistic components: (i) Adaptive Configuration Manager via meta-learning that eliminates manual hyperparameter tuning, (ii) Progressive Distillation Chain with automatically determined intermediate models, (iii) Attention-Weighted Multi-Teacher Ensemble that learns dynamic per-sample weights, (iv) Meta-Learned Temperature Scheduler that adapts temperature throughout training, (v) Parallel Processing Pipeline with intelligent load balancing, and (vi) Shared Optimization Memory for cross-experiment reuse. Experiments on CIFAR-10, CIFAR-100, and tabular datasets demonstrate that HPM-KD: achieves 10x-15x compression while maintaining 85% accuracy retention, eliminates the need for manual tuning, and reduces training time by 30-40% via parallelization. Ablation studies confirm independent contribution of each component (0.10-0.98 pp). HPM-KD is available as part of the open-source DeepBridge library.
- North America > United States > District of Columbia > Washington (0.05)
- South America > Brazil > Federal District > Brasília (0.04)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Poland (0.04)
Rethinking Parameter Sharing as Graph Coloring for Structured Compression
Zhang, Boyang, Cheng, Daning, Zhang, Yunquan
Modern deep models have massive parameter sizes, leading to high inference-time memory usage that limits practical deployment. Parameter sharing, a form of structured compression, effectively reduces redundancy, but existing approaches remain heuristic-restricted to adjacent layers and lacking a systematic analysis for cross-layer sharing. However, extending sharing across multiple layers leads to an exponentially expanding configuration space, making exhaustive search computationally infeasible and forming a critical bottleneck for parameter sharing. We recast parameter sharing from a group-theoretic perspective as introducing structural symmetries in the model's parameter space. A sharing configuration can be described by a coloring function $α:L\rightarrow C$ (L: layer indices and C: sharing classes), which determines inter-layer sharing groups while preserving structural symmetry. To determine the coloring function, we propose a second-order geometric criterion based on Taylor expansion and the Hessian spectrum. By projecting perturbations onto the Hessian's low-curvature eigensubspace, the criterion provides an analytic rule for selecting sharing groups that minimize performance impact, yielding a principled and scalable configuration procedure. Across diverse architectures and tasks, Geo-Sharing consistently outperforms state-of-the-art heuristic sharing strategies, achieving higher compression ratios with smaller accuracy degradation.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.66)